This paper proposes to use low-level spatial features extracted frommultichannel audio for sound event detection. We extend the convolutionalrecurrent neural network to handle more than one type of these multichannelfeatures by learning from each of them separately in the initial stages. Weshow that instead of concatenating the features of each channel into a singlefeature vector the network learns sound events in multichannel audio betterwhen they are presented as separate layers of a volume. Using the proposedspatial features over monaural features on the same network gives an absoluteF-score improvement of 6.1% on the publicly available TUT-SED 2016 dataset and2.7% on the TUT-SED 2009 dataset that is fifteen times larger.
展开▼